| Last month I began my
retrospective review of Codd's first two relational papers -- "Derivability,
Redundancy, and Consistency of Relations Stored in Large Data Banks" (IBM Research
Report RJ599, August 19, 1969) and "A Relational Model of Data for Large Shared
Data Banks" (CACM 13, June 1970). In particular, I took a detailed look at the
first section of the first paper. Just to remind you, that paper had six sections overall:
1. A Relational View of Data
2. Some Linguistic Aspects
3. Operations on Relations
4. Expressible, Named, and Stored Relations
5. Derivability, Redundancy, and Consistency
6. Data Bank Control.
Some Linguistic Aspects
Codd opened this section with the following crucial observation: "The adoption of
a relational view of data ... permits the development of a universal retrieval sublanguage
based on the second-order predicate calculus." (Note the phrase
"second-order;" the 1969 paper explicitly permitted relations to be defined on
domains having relations as elements. I'll come back to this point when I discuss the 1970
paper in detail.)
It was Codd's very great insight that a database could be thought of as a set of
relations, that a relation in turn could be thought of as a set of propositions (assumed
by convention to be true), and hence, that the language and concepts of logic could
be directly applied to the problem of data access and related problems. In this section of
the paper, he sketched the salient features of an access language based on such concepts.
These features include the following: The language would be set level, and the emphasis
would be on data retrieval (though update operations would also be included). Also, the
language would not be computationally complete; it was meant to be a
"sublanguage," to be "[embedded] in a variety of host languages.... Any
[computational] functions needed can be defined in [the host language] and invoked [from
the sublanguage]." Personally, I've never been entirely convinced that factoring out
data access into a separate "sublanguage" was a good idea, but it's certainly
been with us (in the shape of embedded SQL) for a good while now. In this connection,
incidentally, it's interesting to note that with the addition in 1996 of the PSM feature
(Persistent Stored Modules) to the SQL standard, SQL has now become a computationally
complete language in its own right, meaning that a host language is no longer logically
necessary (with SQL, that is).
Codd also wrote, "Some deletions may be triggered by others if deletion
dependencies ... are declared." In other words, Codd already had in mind in 1969 the
possibility of triggered "referential actions" such as CASCADE DELETE
(and in the 1970 paper, he extended this notion to include UPDATE referential
actions as well). Also, the language would provide symmetric exploitation. That is,
the user would be able to access a given relation using any
combination of its attributes as knowns and the remaining ones as unknowns.
"This is a system feature missing from many current information systems." Quite
right. But we take it as a sine qua non now, at least in the relational world. (The
object world doesn't seem to think it's so important, for some reason.)
Operations on Relations
This section of the paper provides definitions of certain relational operations; in
other words, it describes what later came to be called the manipulative part of the
relational model. Before getting into the definitions, however, Codd states: "Most
users would not be directly concerned with these operations. Information systems
designers and people concerned with data bank control should, however, be thoroughly
familiar with [them]." (Italics added.) How true! In my experience, regrettably,
people who should be thoroughly familiar with these operations are all too often
not so.
The operations Codd defines are permutation, projection, join, tie, and composition
(the 1970 paper added restriction, which I'll cover here for convenience). It's
interesting to note that the definitions for restriction and join are rather
different from those usually given today and that two of the operations, tie and composition,
are now rarely considered.
Throughout what follows, the symbols X, Y, ... (and so on) denote either
individual attributes or attribute combinations, as necessary. Also, I'll treat the
definition of join at the end, for reasons that will become clear in a moment.
Permutation. Reorder the attributes of a relation, left to right. (As I noted
last month, relations in the 1969 paper had a left-to-right ordering to their attributes.
By contrast, the 1970 paper states that permutation is intended purely for internal use
because the left-to-right ordering of attributes is -- or should be -- irrelevant so far
as the user is concerned.)
Projection. More or less as understood today (although the syntax is different;
in what follows, I'll use the syntax R{X} to denote the projection of R over X). Note: The
name "projection" derives from the fact that a relation of degree n can be
regarded as representing points in n-dimensional space, and projecting that relation over
m of its attributes (m<n) can be seen as projecting those points on to
the corresponding m axes.
Tie. Given a relation A{X1,X2,...,Xn}, the tie of A is the
restriction of A to just those rows in which A.Xn = A.X1 (using
"restriction" in its modern sense, not in the special sense defined below).
Composition. Given relations A{X,Y} and B{Y,Z}, the composition of
A with B is the projection on X and Z of a join of A
with B (the reason I say "a" join, not "the" join, is explained
below). Note: The natural composition is the projection on X and Z
of the natural join.
Restriction. Given relations A{X,Y} and B{Y}, the restriction of A
by B is defined to be the maximal subset of A such that A{Y} is a
subset -- not necessarily a proper subset -- of B.
Codd also says "all of the usual set operations are [also] applicable ... [but]
the result may not be a relation." In other words, definitions of the specifically
relational versions of Cartesian product, union, intersection, and difference still lay in
the future when Codd was writing his 1969 paper.
Now let's get to the definition of join. Given relations A{X,Y} and B{Y,Z},
the paper defines a join of A with B to be any relation C{X,Y,Z} such
that C{X,Y} = A and C{Y,Z} = B. Note, therefore, that A and B
can be joined (or "are joinable") only if their projections on Y are
identical -- that is, only if A{Y} = B{Y}, a condition one might have thought
unlikely to be satisfied in practice. Also note that if A and B are
joinable, then many different joins can exist (in general). The well known natural join --
called the linear natural join in the paper, in order to distinguish it from another kind
called a cyclic join -- is an important special case, but it's not the only possibility.
Oddly, however, the definition given in the paper for the natural join operation
doesn't require A and B to be joinable in the foregoing special sense! In
fact, that definition is more or less the same as the one we use today.
Let me try to explain where that rather restrictive "joinability" notion
comes from. Codd begins his discussion of joins by asking the important question: Under
what circumstances does the join of two relations preserve all the information in those
two relations? And he shows that the property of "joinability" is sufficient to
ensure that all information is thus preserved (because no row of either operand is lost in
the join). Further, he also shows that if A and B are "joinable"
and either A.X is functionally dependent on A.Y or B.Z is
functionally dependent on B.Y, then the natural join is the only join possible
(though he doesn't actually use the functional dependence terminology -- that also lay in
the future). In other words, what Codd is doing here is laying some groundwork for the
all-important theory of nonloss decomposition (which, of course, he elaborated on
in subsequent papers).
Remarkably, Codd also gives an example that shows he was aware back in 1969 of the fact
that some relations can't be nonloss-decomposed into two projections but can be
nonloss-decomposed into three! This example was apparently overlooked by most of the
paper's original readers; at any rate, it seemed to come as a surprise to the research
community when that same fact was rediscovered several years later. Indeed, it was that
rediscovery that led to Ronald Fagin's invention of the "ultimate" normal form,
5NF, also known as projection-join normal form (PJNF).
Expressible, Named, and Stored Relations
According to Codd, three collections of relations are associated with a data bank: expressible,
named, and stored sets. An expressible relation can be designated by
means of an expression of the data access language (which is assumed to support the
operations described in the previous section); a named relation has a user-known
name; and a stored relation is directly represented in physical storage somehow.
I do have a small complaint here (with 20/20 hindsight, once again): It seems a little
unfortunate that Codd used the term stored relation the way he did. Personally, I
would have divided the expressible relations into two kinds, base relations and derivable
ones; I would have defined a derivable relation to be an expressible one the value of
which, at all times, is derived according to some relational expression from other
expressible relations, and a base relation to be an expressible relation that's not
derivable in this sense. In other words, the base relations are the "given"
ones; the derivable ones are the rest. And then I would have made it very clear that base
and stored relations are not necessarily the same thing. (See Figure 1.) As it is, the
paper effectively suggests that base and stored relations are the same thing
(basically because it doesn't even bother to mention base relations as a separate category
at all). |
| It's true that base relations are essentially the same as stored relations in most SQL
products today. In other words, most people think of base relations as mapping very
directly to physical storage in those products. But there's no logical requirement for
that mapping to be so simple; indeed, the distinction between model and implementation
dictates that we say nothing about physical storage at all in the model. To be more
specific, the degree of variation allowed between base and stored relations should be at
least as great as that allowed between views and base relations; the only logical
requirement is that it must be possible to obtain the base relations somehow from those
that are physically stored (and then the derivable ones can be obtained, too). As I
already indicated, however, most products today provide very little
support for this idea; that is, most products today provide much less data independence
than relational technology is theoretically capable of. And this fact is precisely
why we run into the notorious denormalization issue. Of course, denormalization is
sometimes necessary (for performance reasons), but it should be done at the physical
storage level, not at the logical or base relation level. Because most systems today
essentially equate stored and base relations, however, there is much confusion over this
simple point; furthermore, denormalization usually has the side effect of corrupting an
otherwise clean logical design, with well-known undesirable consequences.
Enough of this griping. Codd goes on to say, "If the traffic on some unnamed but
expressible relation grows to significant proportions, then it should be given a name and
thereby included in the named set." In other words, make it a view! So Codd was
already talking about the idea of views as "canned queries" back in 1969.
"Decisions regarding which relations belong in the named set are based ... on the
logical needs of the community of users, and particularly on the ever-increasing
investment in programs using relations by name as a result of past membership ... in the
named set." Here Codd is saying that views are the mechanism for providing logical
data independence -- in particular, the mechanism for ensuring that old programs
continue to run as the database evolves. And he continues, "On the other hand,
decisions regarding which relations belong in the stored set are based ... on ...
performance requirements ... and changes that take place in these [requirements]."
Here Codd is drawing a very sharp distinction between the logical and physical
levels.
Derivability, Redundancy, and Consistency
In this section, Codd begins to address some of the issues that later came to be
included in the integrity part of the relational model. A relation is said to be derivable
if and only if it's expressible in the sense of the previous section. (Note that
this definition of derivability is not quite the same as the one I was advocating above
because -- at least tacitly -- it includes the base relations.) A set of relations is then
said to be strongly redundant if it includes at least one relation that's derivable
(in Codd's sense) from other relations in the set.
The 1970 paper refines this definition slightly, as follows: A set of relations is strongly
redundant if it includes at least one relation that has a projection -- possibly the identity
projection, meaning the projection over all attributes -- that's derivable from other
relations in the set. (I've taken the liberty of simplifying Codd's definition somewhat,
although, of course, I've tried to preserve his original intent.)
Codd then observes that the named relations probably will be strongly redundant
in this sense, because they'll probably include both base relations and views derived from
those base relations. (What the paper actually says is that "[such redundancy] may be
employed to improve accessibility of certain kinds of information which happen to be in
great demand;" this is one way of saying that views are a useful shorthand.) However,
the stored relations will usually not be strongly redundant. Codd elaborates on
this point in the 1970 paper: "If ... strong redundancies in the named set are
directly reflected ... in the stored set ... then, generally speaking, extra storage space
and update time are consumed [though there might also be] a drop in query time for some
queries and in load on the central processing units."
Personally, I would have said the base relations should definitely not be
strongly redundant, but the stored ones might be (depending -- as always at the
storage level -- on performance considerations).
Codd says that, given a set of relations, the system should be informed of any
redundancies that apply to that set, so that it can enforce consistency; the set
will be consistent if it conforms to the stated redundancies. I should point out, however,
that this definition of consistency certainly doesn't capture all aspects of integrity,
nor does the concept of strong redundancy capture all possible kinds of redundancy. As a
simple counterexample, consider a database containing just one relation, for example EMP
{ EMP#, DEMP#, BUDGET }, in which the following functional dependencies are
satisfied:
EMP# -> DEPT#
DEPT# -> BUDGET
This database certainly involves some redundancy, but it isn't "strongly"
redundant according to the definition.
I should explain why Codd uses the term strong redundancy. He does so to
distinguish it from another kind, also defined in the paper, which he calls weak
redundancy. I omit the details here, however, because unlike just about every other
concept introduced in the first two papers, this particular notion doesn't seem to have
led to anything very significant. (In any case, the example given in the paper doesn't
even conform to Codd's own definition.) The interested reader is referred to the original
paper for the specifics.
Data Bank Control
This, the final section of the 1969 paper, offers a few remarks on protocols for what
to do if inconsistencies are discovered. "The generation of an inconsistency ...
could be logged internally, so that if it were not remedied within some reasonable time
... the system could notify the security officer [sic]. Alternatively, the system could
[inform the user] that such and such relations now need to be changed to restore
consistency ... Ideally, [different system actions] should be possible ... for different
subcollections of relations."
This concludes my discussion of the 1969 paper. Next time, I'll turn my attention to
that paper's more famous successor, the 1970 paper. |